-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return empty result for segmented_reduce if input and offsets are both empty #17437
Conversation
if (segmented_values.is_empty() && offsets.empty()) { | ||
return cudf::make_empty_column(output_dtype); | ||
} | ||
|
||
CUDF_EXPECTS(offsets.size() > 0, "`offsets` should have at least 1 element."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the early exit is not taken here, we probably need at least two elements in the offsets for the segments to be valid.
CUDF_EXPECTS(offsets.size() > 0, "`offsets` should have at least 1 element."); | |
CUDF_EXPECTS(offsets.size() > 1, "`offsets` should have at least 2 elements."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is a specific test for this case so I'm going to leave this line unchanged.
https://github.com/rapidsai/cudf/pull/17437/files#diff-274951d8b19a7c6bf16db78ac17d129ba021bc5135f3d3acfbb1bc814b37ee14R1046
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we did this? It seems like it would be safer and that existing test would still pass.
- Return an empty result column if values OR offsets are empty (currently the conjunction is AND)
- If values are non-empty, require at least two offsets (otherwise indexing the values is impossible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior requires a non-empty result when the input is empty and the offsets are not empty. This appears to be from a request from Spark (reference: #10556 (comment)).
It seems like a non-empty input with more than one offset is reasonable though may not be necessary to check explicitly. I can look into this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, of course. It is possible to have empty values and non-empty offsets. Sorry, that was an oversight on my part.
Thanks for considering the non-empty input requiring >=2 offsets. It would be nice to make this stricter if we can make a good case for its correctness, but I don’t mean to block on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess. It seems like it should be an invalid input but I don’t want us to go in circles here. A test exists so let’s maintain the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the correct behaviour would be to return one element, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one offset, it cannot represent a range. Normally we have N+1 offsets to represent N segments because the i
-th offset is the (inclusive) start of range i
and the (exclusive) end of range i-1
.
If we have 1 offset we have at most zero segments, so the test returns an empty result. My claim is that this may be invalid input, because 1 offset cannot define a range, but the user somehow got a single offset. The legal inputs under my previous proposal (which I no longer want to push for) would be zero/empty offsets for zero segments, or >=2 offsets for representing >=1 segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the correct behaviour would be to return one element, right?
I'm not sure what you are referring to. I believe the testcases cover the expected outcomes. This PR only changes the specific behavior where both the input and offsets are empty and returns an empty result instead of throwing an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legal inputs under my previous proposal (which I no longer want to push for) would be zero/empty offsets for zero segments, or >=2 offsets for representing >=1 segment.
This certainly would be a breaking change we could consider in a separate PR.
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this 🔥
/merge |
Description
Changes the behavior of
cudf::segmented_reduce
to return an empty column if both the input and the offsets parameter are empty.Closes #17433
Checklist